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Abstract 

This paper presents an algorithm for efficient training of sparse linear mod¬ 
els with elastic net regularization. Extending previous work on delayed updates, 
the new algorithm applies stochastic gradient updates to non-zero features only, 
bringing weights current as needed with closed-form updates. Closed-form de¬ 
layed updates for the £i, and rarely used £2 regularizers have been described 
previously. This paper provides closed-form updates for the popular squared norm 
£2 and elastic net regularizers. We provide dynamic programming algorithms that 
perform each delayed update in constant time. The new £2 and elastic net methods 
handle both fixed and varying learning rates, and both standard stochastic gradi¬ 
ent descent (SGD) and forward backward splitting (F 0 B 0 S). Experimental results 
show that on a bag-of-words dataset with 260, 941 features, but only 88 nonzero 
features on average per training example, the dynamic programming method trains 
a logistic regression classifier with elastic net regularization over 2000 times faster 
than otherwise. 


1 Introduction 

For many applications of linear classification or linear regression, training and test 
examples are sparse, and with appropriate regularization, a final trained model can be 
sparse and still achieve high accuracy. It is therefore desirable to be able to train linear 
models using algorithms that require time that scales only with the number of non-zero 
feature values. 

Incremental training algorithms such as stochastic gradient descent (SGD) are widely 
used to learn high-dimensional models from large-scale data. These methods process 
each example one at a time or in small batches, updating the model on the fly. When a 
dataset is sparse and the loss function is not regularized, this sparsity can be exploited 
by updating only the weights corresponding to non-zero feature values for each ex¬ 
ample. However, to prevent overfitting to high-dimensional data, it is often useful to 
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apply a regularization penalty, and specifically to impose a prior belief that the true 
model parameters are sparse and small in magnitude. Unfortunately, widely used reg¬ 
ularizes such as i\ (lasso), (ridge), and elastic net (i\ + i\) destroy the sparsity of 
the stochastic gradient for each example, so they seemingly require most weights to be 
updated for every example. 

This paper builds upon methods for delayed updating of weights first described 
by Hum, and |6). As each training example is processed, the algorithm updates only 
those weights corresponding to non-zero feature values in the example. The model is 
brought current as needed by first applying closed-form constant-time delayed updates 
for each of these weights. For sparse data sets, the algorithm runs in time independent 
of the nominal dimensionality, scaling linearly with the number of non-zero feature 
values per example. 

To date, constant-time delayed update formulas have been derived only for the 
and ioo regularizes, and for the rarely used regularizes We extend previous work 
by showing the proper closed form updates for the popular i\ squared norm regularizer 
and for elastic net regularization. When the learning rate varies (typically decreasing as 
a function of time), we show that the elastic net update can be computed with a dynamic 
programming algorithm that requires only constant-time computation per update]] 

A straightforward experimental implementation of the proposed methods shows 
that on a representative dataset containing the abstracts of a million articles from 
biomedical literature, we can train a logistic regression classifier with elastic net regu¬ 
larization over 2000 times faster than using an otherwise identical implementation that 
does not take advantage of sparsity. Even if the standard implementation exploits spar¬ 
sity when making predictions during training, additionally exploiting sparsity when 
doing updates, via dynamic programming, still makes learning 1400 times faster. 

2 Background and Definitions 

We consider a data matrix X G JR nxd where each row Xi is one of n examples and 
each column, indexed by j, corresponds to one of d features. We desire a linear model 
parametrized by a weight vector w £ lR d that minimizes a convex objective function 
F(w) expressible as X]"=i Fi(w), where F is the loss with respect to the entire dataset 
X and Fi is the loss due to example x t . 

In many datasets, the vast majority of feature values Xi 3 are zero. The-bag-of-words 
representation of text is one such case. We say that such datasets are sparse. When 
features correspond to counts or to binary values, as in bag-of-words, we sometimes 
say say that a zero-valued entry x, 3 is absent. We use p to refer to the average number of 
nonzero features per example. Naturally, when a dataset is sparse, we prefer algorithms 
that take time 0(p ) per example to those that require time ()(d). 

1 The dynamic programming algorithms below for delayed updates with varying learning rates use time 
0(1) per update, but have space complexity 0(T) where T is the total number of stochastic gradient updates. 
If this space complexity is too great, that problem can be solved by allotting a fixed space budget and bringing 
all weights current whenever the budget is exhausted. As the cost of bringing weights current is amortized 
across many updates, it adds negligibly to the total running time. 
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2.1 Regularization 

To prevent overfitting, regularization restricts the freedom of a model’s parameters, 
penalizing their distance from some prior belief. Widely used regularizes penalize 
large weights with an objective function of the form 

F(w) = L(w) + R(w). (1) 

Many commonly used regularizers R(w) are of the form A||tu|| where A determines 
the strength of regularization and the t \, or norms are common choices for 
the penalty function. The t \ regularizer is popular owing to its tendency to produce 
sparse models. In this paper, we focus on elastic net, a linear combination of t\ and /'| 
regularization that has been shown to produce comparably sparse models to l\ while 
often achieving superior accuracy nm 

2.2 Stochastic Gradient Descent 

Gradient descent is a common strategy to learn optimal parameter values w. To mini¬ 
mize F(w), a number of steps T, indexed by /;, are taken in the direction of the negative 
gradient: 

n 

w ( t + 1 ) := w (t) _ ^ X7Fi(w) 

i—i 

where the learning rate rj may be a function of time t. An appropriately decreasing rj 
ensures that the algorithm will converge to a vector w within distance e of the optimal 
vector for any small value e E). 

Traditional (or “batch”) gradient descent requires a pass through the entire dataset 
for each update. Stochastic gradient descent (SGD) circumvents this problem by up¬ 
dating the model once after visiting each example. With SGD, examples are randomly 
selected one at a time or in small so-called mini-batches. For simplicity of notation, 
without loss of generality we will assume that examples are selected one at a time. At 
time t + 1 the gradient V F,(w <t> ) is calculated with respect to the selected example 
Xi, and then the model is updated according to the rule 

u/ t+ 1) := w ^ Fi(w l ' t ' > ). 

Because the examples are chosen randomly, the expected value of this noisy gradient 
is identical to the true value of the gradient taken with respect to the entire corpus. 

Given a continuously differentiable convex objective function F(w), stochastic 
gradient descent is known to converge for learning rates i] that satisfy ^T jt r] t = oo and 
< oo [U. Learning rates 7/i oc l/t. and r/i oc 1/v/f both satisfy these properties]! 
For many objective functions, such as those of linear or logistic regression without 
regularization, the noisy gradient V F, {w) is sparse when the input is sparse. In these 
cases, one needs only to update the weights corresponding to non-zero features in the 

- Some common objective functions, such as those involving i\ regularization, are not differentiable 
when weights are equal to zero. However, forward backward splitting (FoBoSj offers a principled approach 
to this problem CD. 
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current example x t . These updates require time 0{p), where p <C d is the average 
number of nonzero features in an example. 

Regularization, however, can ruin the sparsity of the gradient. Consider an objec¬ 
tive function as in Equation Q. where R(w) = ||m|| for some norm || • ||. In these 
cases, even when the feature value Xij = 0, the partial derivative ( d/dwj)Fi is nonzero 
owing to the regularization penalty if Wj is nonzero. A simple optimization is to up¬ 
date a weight only when either the weight or feature is nonzero. Given feature sparsity 
and persistent model sparsity throughout training, not updating Wj when w 3 = 0 and 
Xij = 0 provides a substantial benefit. But such an approach still scales with the size 
of the model, which may be far larger than p. In contrast, the algorithms below scale 
in time complexity 0{p). 


2.3 Forward Backward Splitting 

Proximal algorithms are an approach to optimization in which each update consists 
of solving a convex optimization problem |9). Forward Backward Splitting (FoBoS) 
El is a proximal algorithm that provides a principled approach to online optimization 
with non-smooth regularizers. We first step in the direction of the negative gradient of 
the differentiable unregularized loss function. We then update the weights by solving 
a convex optimization problem that simultaneously penalizes distance from the new 
parameters and minimizes the regularization term. 

In FoBoS, first a standard unregularized stochastic gradient step is applied: 

u?( t+ s) = w ^ — rj^VLi(w^). 

Note that if (d / dvjj)Li = 0 then w < J t+2 ' 1 = w^\ Then a convex optimization is 
solved, applying the regularization penalty. For elastic net the problem to be solved is 


u>( t+1 ) = argmin u 




+ 77 t Ai||tn||i + -r] t X 2 \\w 


( 2 ) 


The problems corresponding to /:j or separately can be derived by setting the corre¬ 
sponding A to 0. 


3 Lazy Updates 

The idea of lazy updating was introduced in J2j, {6j, and IfTTl . This paper extends the 
idea for the cases of j and elastic net regularization. The essence of the approach is 
given in Algorithm (jT|). We maintain an array ip £ ]R d in which each ipj stores the 
index of the last iteration at which the value of weight j was current. When process¬ 
ing example Xi at time k, we iterate through its nonzero features x l:j . For each such 
nonzero feature, we lazily apply the k — ipj delayed updates collectively in constant 
time, bringing its weight Wj current. Using the updated weights, we compute the pre¬ 
diction y ( ' k> with the fully updated relevant parameters from nP k> . We then compute 
the gradient and update these parameters. 
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Algorithm 1 Lazy Updates 

Require: £ lR rf 

for tel, T do 

Sample Xi randomly from the training set 
for j s.t. Xij ^ 0 do 

Wj 4 — Lazy(wj,t,i/jj) 
ipj ^— t 

end for 

w 4 — w — VFi( w) 

end for 


When training is complete, we pass once over all nonzero weights to apply the 
delayed updates to bring the model current. Provided that we can apply any number 
k — i/jj of delayed updates in 0(1) time, the algorithm processes each example in 0{p) 
time regardless of the dimension d. 

To use the approach with a chosen regularizer, it remains only to demonstrate the 
existence of constant time updates. In the following subsections, we derive constant¬ 
time updates for t\, t\ and elastic net regularization, starting with the simple case 
where the learning rate y is fixed during each epoch, and extending to the more com¬ 
plicated case when the learning rate is decreased every iteration as a function of time@ 

4 Prior Work 

Over the last several years, a large body of work has advanced the field of online 
learning. Notable contributions include ways of adaptively decreasing the learning rate 
separately for each parameter such as AdaGrad 0 and AdaDelta ED, using small 
batches to reduce the variance of the noisy gradient 0, and other variance reduction 
methods such as Stochastic Average Gradient (SAG) iTTOl and Stochastic Variance Re¬ 
duced Gradient (SVRG) 0. 

In 2008, Carpenter described an idea for performing lazy updates for stochastic 
gradient descent 0. With that method, we maintain a vector i/> e N d , where each tpi 
stores the index of the last epoch in which each weight was last regularized. We then 
perform periodic batch updates. However, as the paper acknowledges, the approach 
described results in updates that do not produce the same result as applying an update 
after each time step. 

Langford et al. concurrently developed an approach for lazily updating regular¬ 
ized linear models |6]. They restrict attention to ii\ models. Additionally, they describe 
the closed form update only when the learning rate y is constant, although they suggest 
that an update can be derived when ry decays as t grows large. We derive constant-time 
updates for and elastic net regularization. Our algorithms are applicable with both 
fixed and varying learning rates. 

3 The results hold for schedules of weight decrease that depend on time, but cannot be directly applied to 
AdaGrad jTj or RMSprop, methods where each weight has its own learning rate which is decreased with the 
inverse of the accumulated sum (or moving average) of squared gradients with respect to that weight. 


5 






In 2008 also, as mentioned above, Duchi and Singer described the FoBoS method 
a. They share the insight of applying updates lazily when training on sparse high¬ 
dimensional data. Their lazy updates hold for norms £ q for q £ {1,2, oo}. However 
they do not hold for the commonly used squared norm. Consequently they also 
do not hold for mixed regularizers involving t\ such as the widely used elastic net 

(h + e 2 y 


5 Constant-Time Lazy Updating for SGD 

In this section, we derive constant-time stochastic gradient updates for use when pro¬ 
cessing examples from sparse datasets. Using these, the lazy update algorithm can 
train linear models with time complexity 0(p) per example. For brevity, we describe 
the more general case where the learning rate is varied. When the learning rate is 
constant the algorithm can be easily modified to have 0(1) space complexity. 


5.1 Lazy Regularization with Decreasing Learning Rate 

The closed-form update for t\ regularized models is 03) 



sgn(u>^ J 



| - Ar (S(k - 1) - S{^ 



where S(t) is a function that returns the partial sum X^ =0 rThe sum 
can be computed in constant time using a caching approach. On each iteration t, we 
compute S(t) in constant time given its predecessor as S(t) = ryW + S(t — 1). The 
base case for this recursion is S'(O) = rj^°\ We then cache this value in an array for 
subsequent constant time lookup. 

When the learning rate decays with 1 /t, the terms r( T follow the harmonic series,. 
Each partial sum of the harmonic series is a harmonic number H(t ) = Y^\=i 1 A- 
Clearly 

t-\-n— 1 

?/ T ) = (H(t + n) — H(t)) 

T=t 

where H T is the r t h harmonic number. While there is no closed-form expression to 
calculate the r t h harmonic number, there exist good approximations. 

The 0(T ) space complexity of this algorithm may seem problematic. However, 
this problem is easily dealt with by bringing all weights current after each epoch. The 
cost to do so is amortized across all iterations and is thus negligible. 


5.2 Lazy Regularization with Decreasing Learning Rate 

For a given example Xi, if the feature value x vl = 0 and the learning rate is varying, 
then the stochastic gradient update rule for an (:' 2 2 regularized objective is 

«y +1) = w'f 1 - rj^X^w^. 
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The decreasing learning rate prevents collecting successive updates as terms in a geo¬ 
metric series, as we could if the learning rate were fixed. However, we can employ a 
dynamic programming strategy. 

Lemma 1 . For SGD with t\ regularization, the constant-time lazy update to bring a 
weight current from iteration tpj to k is 



_ p ( k ~ X ) 
j PM - 1) 


where P(t) is the partial product nt=o(^ — 77 (t) A 2 ). 
Proof Rewriting the multiple update expression yields 


w^ +n) = w < 'p{ 1 - r? ( ^A 2 )(l - r7 (t+1) A 2 ) •... • (1 - ?/ t+n_1) A 2 ). 

The products P(t) = JJ* =0 (1 — t?*' T '*A 2 ) can be cached on each iteration in constant 
time using the recursive relation 


The base case is P(0) = ao = (1 — ? 7 oA 2 ). Given cached values P(0),..., P(t + n), it 
is then easy to calculate the exact lazy update in constant time: 


= „,(«) Pjt + n- !) 

j j P(i-l) ’ 


The claim follows. □ 

As in the case of t\ regularization with fixed learning rate, we need not worry that 
the regularization update will flip the sign of the weight Wj, because P(t) > 0 for all 

t > 0. 


5.3 Lazy Elastic Net Regularization with Decreasing Learning Rate 

Next, we derive the constant time lazy update for SGD with elastic net regularization. 
Recall that a model regularized by elastic net has an objective function of the form 

F(w) = L(w) + AilMli + ylMI- 2 - 


When a feature Xj = 0, the SGD update rule is 

w4* +1) = sgn(to^) [|«;^| - t? w Ai - r? (t) A 2 |w^ 


= sgn(ur- ) (1 - t}^\ 2 )\Wj | - ?? (t) Ai 


(3) 
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Theorem 1. To bring the weight Wj current from time tp 3 to time k using repeated 
Equation 0 updates, the constant time update is 



sgn (w < f 3 ' 1 ) 



P(k - 1 ) 

Ptyj ~ 1) 


P(k - 1 ) • {B{k - 1 ) - B(pl>j 



where P(t) = (1 — p^Xf) ■ P(t — 1 ) with base case P(—1) = 1 and B(t) = 
rf 1 ') /P(r — 1 ) with base case B{— 1 ) = 0. 

Proof. The time-varying learning rate prevents us from working out a simple expan¬ 
sion. Instead, we can write the following inductive expression for consecutive terms in 
the sequence: 


„(*+!) 


= sgn(ur- ) (1 - ?/^A2 )|w:- 4) | - 77^Ai 


J + 


Writing a T = (1 — 17^X2) and b T = —rj^Xi gives 


w (t+1) = sgn(wV) 



w 


(*+") = Sga(wf ] ) [«((+„_!)(...a (i+ i) (<w - b t ) - &( t +i)-) - b (tt+n _ 1) ]_ 


= sgn (w[^) 


\w 


Wi 


t-\-n— 1 

n ■ 


t-\-n —2 


f t-\-n —2 


E k { ]^[ a q ] + 6( t +n-!) 


q=T 


J + 


The leftmost term IlE™ -1 a i- can calculated in constant time as P(t + n— 1)/P(t — 
1) using cached values from the dynamic programming scheme discussed in the pre¬ 
vious section. To cache the remaining terms, we group the center and rightmost terms 
and apply the simplification 



+ bt+n -1 


P(t + n — 2) P(f + n — 2) 
= bt —~-h bt+ 1- 


P(f-l) 

= — Ai P(t + n — 2) 


P(f) 


At) 


At+i) 


+ 


P(t- 1) P(t) 


, j, P(f + n — 2) 

+ ^- 1 P(i + n _2) 

^(t+n-l) \ 

■’■ + P(i + n-2) ) ' 


We now add a new layer to the dynamic programming formulation. In addition to 
precalculating all values P(f) as we go, we define a partial sum over inverses of partial 
products 


m = E 


A r ) 


P(t - 1)' 

T=0 V ’ 


Given that P(t — 1) can be accessed in constant time at time t, B(t ) can now be cached 
in constant time. With the base case B(— 1) = 0, the dynamic programming here 
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depends upon the recurrence relation 


B{t)=B{t- 1) + 




P(t-lY 


Then, for SGD elastic net with decreasing learning rate, the update rule to apply 
any number n of consecutive regularization updates in constant time to weight Wj is 


y t+n) = sgn (w t] ) 


(t) P(t + n- 1) 
* 1 P(t- 1) 


- Ai P(t + n- 1) (B(t + n-l)-B(t- 1)) 


□ 


6 Lazy Updates for Forward Backward Splitting 


Here we turn our attention to FoBoS updates for £\ and elastic net regularization. For 
£\ regularization, to apply the regularization update we solve the problem from Equa¬ 
tion (0 with Ai set to 0. Solving for w* gives the update 


„(*+!) 


w 


(t) 


1 + A 2 


when Xij = 0. Note that this differs from the standard stochastic gradient descent step. 
We can store the values $(t) = nt=o T+TpF' Then, the constant time lazy update for 
FoBoS with £2 regularization to bring a weight current at time k from time ipj is 



..XM $ ( fc ~ x ) 

J m ~ 1) 


where <F(f) = (1 + ri^X 2 )~ 1 ■ <E>(i — 1) with base case <I>(0) = 1+ * 0Ai> . 

Finally, in the case of elastic net regularization via forward backward splitting, we 
solve the convex optimization problem from Equation 0. This objective also comes 
apart and can be optimized for each Wj separately. Setting the derivative with respect 
to Wj to zero yields the solution 


w 


(*+i) 


sgn(uij t} ) 


\w^\ - ry^Ai 
rf A 2 + 1 


Theorem 2. A constant-time lazy update for FoBoS with elastic net regularization and 
decreasing learning rate to bring a weight current at time kfrom time is 



= sgn (w^) 



$0-1) 
- 1) 


*(* - 1) • A, (P(k - 1) - f3(i/jj 



where $(f) = T>(f—1)- 1+ ^t Xo with base case <1>(—1) = land/3(t ) = Pit— 1)+ 
with base case /3(—1) = 0. 
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Proof. Write a* = (++2 + 1) 1 and bt = Note that neither a* nor bt 

depends upon Wj. Consider successive updates: 


w j t+1 ' > = s S n i w j^) 

a t(\ w j^ \ + b t) 

+ 




t+n—1 


t+n—1 

/ t+n—1 \ 

w (t+’i) _ sgn^W) 

\ w j 1 n «*+ 
&=* 

E 

T=t 

I 


-i + 


Inside the square brackets, ^ can be substituted for 1 «/3 and the second 

term can be expanded as 

t-\-n —1 / t-\-n— 1 


rt+n— 1 


i+n—1 


E K n °“ = E 


+ n — 1) 
<E>(t — 1 ) 

t+n— 1 

= —3>(t + n — 1) • Ai ^2 


n ( T ) 


<E>(t — 1) 


Using the dynamic programming approach, for each time t, we calculate 


/3(f) = /3(f - 1) + 


, 0 ) 


$(f-l) 

with the base cases /3(0) = 77 ^°) and /?(—1) = 0. Then 

w (‘ +n) = sgn(u+) - $(f + n - 1) • Ai (/3(f + n - 1) - /3(t - 1)) 


□ 


7 Experiments 

The usefulness of logistic regression with elastic net regularization is well-known. To 
confirm the correctness and speed of the dynamic programming algorithm just pre¬ 
sented, we implemented it and tested it on a bag-of-words representation of abstracts 
from biomedical articles indexed in Medline as described in ( 8 ). The dataset contains 
exactly 10 6 examples, 260, 941 features and an average of 88.54 nonzero features per 
document. 

We implemented algorithms in Python. Datasets are represented by standard sparse 
SciPy matrices. We implemented both standard and lazy F 0 B 0 S for logistic regression 
regularized with elastic net. We confirmed on a synthetic dataset that the standard 
F 0 B 0 S updates and lazy updates output essentially identical weights. To make a fair 
comparison, we also report results where the non-lazy algorithm exploits sparsity when 
calculating predictions. Even when both methods exploit sparsity to calculate y , lazy 
updates lead to training over 1400 times faster. Note that sparse data structures must 
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Lazy Updates 

Dense Updates 

Dense with Sparse Predictions 

SGD 

.0102 

21.377 

14.381 

FoBoS 

.0120 

22.511 

16.785 


Table 1: Average time in seconds for each algorithm to process one example. 


be used even with dense updates, because a dense matrix to represent the input dataset 
would use an unreasonable amount of memory. 

Logistic regression with lazy elastic net regularization runs approximately 2000 
times faster than with dense regularization updates for both SGD and FoBoS. In the 
absence of overhead, exploiting sparsity should yield a 2947x speedup. Clearly the 
additional dynamic programming calculations do not erode the benefits of exploiting 
sparsity. While the dynamic programming strategy consumes space linear in the num¬ 
ber of iterations, it does not present a major time penalty. Concerning space, storing 
two floating point numbers for each time step t is a modest use of space compared 
to storing the data itself. Further, if space ever were a problem, all weights could pe¬ 
riodically be brought current. The cost of this update would be amortized across all 
iterations and thus would be negligible. 

8 Discussion 

Many interesting datasets are high-dimensional, and many high-dimensional datasets 
are sparse. To be useful, learning algorithms should have time complexity that scales 
with the number of non-zero feature values per example, as opposed to with the nom¬ 
inal dimensionality. This paper provides algorithms for fast training of linear models 
with t\ or with elastic net regularization. Experiments confirm the correctness and 
empirical benefit of the method. In future work we hope to use similar ideas to take 
advantage of sparsity in nonlinear models, such as the sparsity provided by rectified 
linear activation units in modern neural networks. 
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